scEWE: high-order element-wise weighted ensemble clustering for heterogeneity analysis of single-cell RNA-sequencing data

Abstract With the emergence of large amount of single-cell RNA sequencing (scRNA-seq) data, the exploration of computational methods has become critical in revealing biological mechanisms. Clustering is a representative for deciphering cellular heterogeneity embedded in scRNA-seq data. However, due to the diversity of datasets, none of the existing single-cell clustering methods shows overwhelming performance on all datasets. Weighted ensemble methods are proposed to integrate multiple results to improve heterogeneity analysis performance. These methods are usually weighted by considering the reliability of the base clustering results, ignoring the performance difference of the same base clustering on different cells. In this paper, we propose a high-order element-wise weighting strategy based self-representative ensemble learning framework: scEWE. By assigning different base clustering weights to individual cells, we construct and optimize the consensus matrix in a careful and exquisite way. In addition, we extracted the high-order information between cells, which enhanced the ability to represent the similarity relationship between cells. scEWE is experimentally shown to significantly outperform the state-of-the-art methods, which strongly demonstrates the effectiveness of the method and supports the potential applications in complex single-cell data analytical problems.


INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) technology has rapidly gained attention since the first release in 2009.With the emergence of large amount of scRNA-seq data, the exploration of data analysis methods has become critical in revealing biological mechanisms.Among them, heterogeneity analysis of scRNAseq data builds the basis for downstream analysis by revealing cellular complexity, including cell type heterogeneity and the transcriptomic signatures.
In recent years, a great number of algorithms have been proposed [1][2][3][4] for addressing the problems of single-cell heterogeneity analysis.pcaReduce [5] presents an agglomerative clustering framework by combining principal component analysis (PCA) and k-means methods for inferring cellular hierarchies.[6] applies PCA algorithm and Laplacian transformation for dimension reduction of scRNA-seq data, and then employs cluster-based similarity partitioning algorithm (CSPA) to robustly analyze the heterogeneity embedded in scRNA-seq data.RaceID [7] assumes that different cell types robustly express some specific 'outlier' genes, achieving efficient identification of rare cell types in complex cell mixtures.Among the graph-based algorithms, CIDR [8] uses a simple implicit imputation method to mitigate the impact of data loss in single-cell transcriptome data, and then performs clustering based on the first few major coordinates with principal coordinates analysis.Celltree [9] is based on a powerful Bayesian statistical framework that represents cells as a statistical mixture of classes, allowing the capture of subtle evolutionary features along a continuum of soma and handling heterogeneous groups well.SCENIC [10] uses GENIE3 and GRNBoost (Gradient Boosting) to infer co-expression modules between transcription factors and candidate target genes, thereby providing a clearer data structure for downstream clustering.SIMLR [11] combines multiple kernels to learn the stable distance measure most suitable for the data structure to perform spectral clustering.SNN-Cliq as a graph based method [12] proposes a cluster-based clustering algorithm with a similarity measure based on shared nearest neighbors between cells, which can identify cell clusters of different densities and shapes.[2] proposed a hierarchical clustering framework based on new similarity measure using cell-pair correlation, showing robustness in cellular heterogeneity analysis for small scRNA-seq data sets.A kernel non-negative matrix factorization framework was further developed for dealing with relatively large size of scRNA-seq data in [4].
With the rapid development of deep learning, a lot of methods based on neural networks have been studied in recent years.ScDeepcluster [13] uses an autoencoder network to simultaneously learn low-dimensional feature representation and cluster assignment.ScDCC [3] encodes prior knowledge into constraints based on scDeepcluster method and integrated into the clustering process through a new loss function.scGNN [14] proposed a lefttruncated mixture Gaussian model to evaluate heterogeneous gene expression patterns and aggregate cell relationships with graph neural networks.scSemiGAN [15] builds a semi-supervised cell type annotation and dimensionality reduction framework based on the generative adversarial network.
Different methods yielded different results even for the same problems, as no existing method outperforms all the other competitors in all scenarios.Hence it is of great challenge to select the optimal method in single-cell heterogeneity analysis.Therefore, the idea of integrated learning offers a compelling alternative.A lot of works have been done based on weighted fusion of singlecell clustering results.SAFE [16] uses hyper-graph partitioning algorithm to integrate multiple clustering methods to build the final consensus partition.SAME [17] developed a mixture model ensemble clustering method to robustly analyze the cellular heterogeneity in scRNA-seq data.sc-GPE [18] calculates the Adjusted Rand Index (ARI) between the base clustering results, assigning greater weight to base clusterings that are more similar to other results.However, the weight strategy in the above ensemble representatives mainly considers base clustering as a whole, while neglects the element-wise contribution inside base-clustering.Besides, most existing methods ignore the high-order connection information between cells which may encode comprehensive structure information.In this paper, we propose a novel elementwise weighted ensemble method: scEWE, which introduces the first-order and second-order similarity in element-wise weight matrix construction and adaptively optimize to construct the final consensus co-association matrix to capture the high-order similarity relationship between cells as well as guarantee a robust ref lection on the relationship between base clusterings.Finally, the obtained weighted consensus matrix is incorporated into the spectral clustering framework with low rank representation to indicate the stable heterogeneity result for scRNA-seq data.

METHODS
We proposed a high-order Element-wise Weighted Ensemble learning model for single-cell data analysis(scEWE).The architecture of scEWE is shown in Figure 1.
scRNA-seq data usually have tens of thousands of genes in attribute space, with a large number of genes in low expression or undetected, which interferes with the correct identification of cell populations.Based on the characteristics of high-spasity, highnoise and high-dimensionality in scRNA-seq data, we assume a large number of attributes will make little contribution to the clear data representation of the data.On the other hand, highly variable genes are generally considered important for distinguishing cellular heterogeneity.We therefore select the top few highly variable genes with the largest variance to extract the rich information contained in the scRNA-seq data.In data preprocessing stage, we filtered the top 10% of genes with the largest variance.

Weighted co-association matrix
We let X = {x 1 , • • • , x N } be a dataset with N cells, where x i represents the i-th cell.We use P = π 1 , • • • , π M to represent the M base clusterings in the ensemble, where the M-th base clustering contains n m clusters.Moreover, cls m (x i ) denotes the cluster to which the i-th cell belongs in the m-th clustering.
The co-occurrence matrix A m for m−th base clustering is constructed as Traditional co-association matrix A is constructed by averaging One way to improve the representation ability of A is to assign weights β m to A m : However, due to the diversity of cell clusters, base clusterings that perform well on some cells may perform poorly on other cells.In order to better exploit the strength of base clusterings, we weight the base clusterings at the cell scale.
Specifically, we first construct the initial co-association matrix A according to Eq. ( 2), and then improve the representation of the co-association matrix by iterative weighting.Given the initial coassociation matrix A, we construct the τ -nearest neighborhood set of cell i as follows: where i 1 , • • • , i τ denote the column numbers corresponding to the τ largest elements in row i of A. Note that we base our framework on co-association matrix when constructing the cell neighborhood, which can make full use of global information.Intuitively, with the increasing of the similarity between the τnearest neighbor set of a cell and the cluster to which the cell belongs, the clustered cell becomes more reliable.We use the Jaccard coefficient as a similarity measure between sets, and the weight matrix W (1) is constructed as follows: where Figure 1.Flowchart for scEWE.
Here Q 1 and Q 2 represent two arbitrary sets, and | • | denotes the number of elements in the set.The larger W (1)  ij , the better the clustering effect of the j-th base clustering on the i-th cell.
To explore multi-scale associations between cells, we construct a second-order co-association matrix to describe the higher order similarity between cells.In real-world scenarios, if two people have many mutual friends, there is a high probability that the two people are acquainted.Inspired by this phenomenon, two cells that are very similar to each other should also belong to the same type.Therefore, the similarity information in the co-association matrix is regarded as a highly condensed sample feature.The Gaussian radial basis function is used to calculate the similarity between cells: where xi represents the i-th row data of the initial co-association matrix A (Eq. ( 2)), xi .Then, we obtain the second-order co-association matrix A (2) according to the cell similarity: Denote where i 1 , • • • , i τ denote the column numbers corresponding to the τ largest elements in row i of A (2) .The second-order weight matrix W (2) is constructed accordingly: We express the final weight matrix W as a linear combination of the first-order and second-order weight matrices: The parameter λ balances the proportion of the first-order and second-order weight matrices.For the convenience of calculation, the matrix W was represented in the form of column vectors W = {w 1 , w 2 , • • • , w M }, where w i is the i-th column vector of W. The weighted co-association matrix S is computed as where denotes the matrix dot product, and(•) γ represents element-wise exponentiation.The scale parameter γ ∈ [0, ∞) measures the importance of the weighting method in the ensemble process.If we want to enhance the inf luence of cell weights on the weighted co-association matrix S, we should set a larger γ .
We iteratively update the weighting matrix as represented by W = {w 1 , w 2 , • • • , w M } to ensure a stable S. Specifically, for S i we follow the steps of Eqs. ( 4)- (11) to generate updated weight matrix

Optimization framework
After obtaining the weighted co-association matrix S, a common practice is to use the spectral clustering algorithm to output the clustering results, since S can be viewed as the similarity matrix of the cells: where L S stands for the normalized Laplacian matrix of S, and F represents the clustering result matrix.However, noise in single-cell data often disrupts the data structure, making the clustering results significantly different from the real cell clusters.We regard attribute matrix for each cluster form a subspace of the feature space, and introduce the idea of subspace clustering for low-rank recovery.Here, the cells polluted by noise are considered outliers.In order to make the weighted co-association matrix S have a clearer structure for more accurate clustering results, we iteratively generate a block-diagonal matrix Y N×N based on robust spectral ensemble clustering [19].Given S, a self-representation SY on a low-rank subspace is used to approximately reconstruct S with reconstruction error R. Y is a dictionary coefficient matrix, which is used to extract important features of S. It is worth mentioning that the existing theory has proved that when S is used as the dictionary of S itself, the dictionary coefficient matrix Y has excellent properties to facilitate the subsequent clustering.These properties include low rank, block diagonal and approximate symmetry.The optimization function is as follows: where Here, D Y stands for the degree matrix of (Y + Y T )/2 + FF T .The purpose of constructing the Laplacian matrix using (Y+Y T )/2+FF T instead of S is to strengthen the block diagonal structure of the cell similarity matrix during optimization.

Optimization of cluster number k
It is of critical significance to give an accurate estimation of cluster number.Inspired by [2], we determined the number of clusters k based on variance analysis.Variance analysis describes the contribution of variation from different sources to the total variation.We regarded the inter-cluster difference after clustering as the inherent difference of different cell clusters (SSW) and regarded the difference between intra-cluster as SSB.Suppose that cells are clustered into K clusters, where the k-th cluster has n k cells.As before, we used cls(x i ) to indicate the cluster to which the i-th cell belongs.Define the k-th cluster center as follows: where 1(•) is an indicator function.Therefore, accordingly, An effective cluster partition usually results in large intercluster distance and small intra-cluster distance.Therefore, we attempt to obtain better cluster partitions by making Rate = SSB SST as large as possible.However, as the number of cell clusters increases, the Rate value usually increases monotonically.We therefore used the growth rate as a criterion and the optimal cluster number is achieved by solving the following optimization problem: where Rate i (k) denotes the Rate value of the i-th feature when the number of clusters is specified as k.
The value of q in the above equation represents the number of top differentially expressed genes.
6: end while 7: Obtain F by optimize Eq. ( 14).8: Perform spectral clustering on FF T to get the final result.
Finally, F learns the cluster partition of cells, and we perform spectral clustering on FF T to get the final clustering result.We summarize the procedure details in Algorithm 1.

Performance evaluation
We use two common indicators, ARI and Nomalized Mutual Information (NMI), to evaluate the performance of scEWE and compared methods.Suppose P = {P 1 • • • P s } represents the predicted label set and Then the ARI is calculated as follows: where a i is the number of cells labeled P i , b j is the number of cells labeled T j .The range of ARI is [−1, 1], and a larger value indicates a better clustering result.NMI uses information entropy theory to describe the similarity of label set distribution, which is calculated as follows: The range of NMI is [0, 1], and a larger value indicates a better clustering result.

SHARP
As an algorithm for processing large-scale single-cell data, the SHARP method [20] consists of three steps: dividing the cells into different groups, using a random projection algorithm for dimensionality reduction for each group and weighting the results after dimensionality reduction.Finally, SHARP obtains the final clustering result according to the similarity relationship between cells in the consensus matrix.

SIMLR
SIMLR [11] uses different distance metrics to construct a nuclear similarity matrix for cells, determines the weight by visualizing each nucleus and then uses spectral clustering to learn cell partitions.

CIDR
CIDR [8] uses a simple implicit imputation method to mitigate the impact of dropout in scRNA-seq data, followed by clustering based on the first few principal coordinates.

SC3
SC3 [21] uses PCA and Laplacian transformation for dimension reduction processing, and then uses CSPA to partition the consensus matrix.Despite the slow speed when integrating large-scale data sets, SC3 is still one of the mainstream algorithms for singlecell clustering due to its superior performance.

Seurat
Seurat [22] first calculates the Euclidean distance of the k nearest cells to each cell in the space after PCA dimensionality reduction, and constructs a k-NN graph.It then refines the edge weights between any two cells based on their shared overlap in their local neighborhood, attempts to partition the graph into highly interconnected communities and finally applies a modular optimization technique (Louvain) to group cells.

SAFE
SAFE [16] is an ensemble clustering method for scRNA-seq data.
It embeds four state-of-the-art methods, SC3, CIDR, Seurat and t-SNE+k-means for ensemble learning, and used three hypergraphbased partitioning algorithms for final clustering result integration.

SAME
SAME [17] is a mixture model-based approach for scRNA-seq data clustering.SC3, CIDR, Seurat, t-SNE + k-means and SIMLR are the five base clustering algorithms.Normalization and transformation of scRNA-seq data are executed in the initial stage.A subset of four diverse sets of clustering solutions are then combined for final cluster ensemble.

Experiments
We used publicly available datasets from published papers to test the performance of scEWE, and the details of the datasets are shown in Table 1.In data preprocessing stage, we first filtered the top 10% of genes with the largest variance.
In the following text, the five single-cell clustering algorithms: SHARP, SIMLR, CIDR, SC3 and Seurat were used to generate base clusterings for our ensemble learning framework.We generated a candidate clustering pool(composed of 50 base clusterings) by constructing 10 base clusterings for each of the five clustering algorithms, through perturbation on the top 10% selected genes with 10 genes taken as intervals, and the parameters are set as γ = 0.5, σ 1 = 1000, σ 2 = 100 and the maximum number of iterations to 50.Specifically, the 10 base clustering results are generated as follows.Assuming the number of genes in the scRNA-seq dataset is m, the base clustering results were generated on the filtered dataset with a number of selected variable genes, where the top variable gene is measured with largest variance in gene expression across all the cells.The number of selected variable genes n g (i) in the i-th base clustering result can be represented in the following formula: where [m/10] is the nearest integer around m/10.The intuition behind the generation process is as follows.The scRNA-seq data set is sparse and noisy.Hence for each dataset, we use variable gene expression to represent the characteristics of the cells; however, usually different clustering results will be obtained by taking different numbers of variable genes.It is unclear what is the appropriate number of genes that should be involved in the clustering.Therefore, we generate 10 base clustering results for the given dataset with varying numbers of variable genes, which are used for further ensemble learning and may lead to better and more robust results.It is worth mentioning that we take the seed number of 5 for all the random parts in the algorithm, and still take this seed number for comparing methods when comparing the results with the base clustering method.The parameter λ in Eq. ( 11) plays an important role in balancing the first-order weight matrix and the second-order weight matrix.Motivated by variance analysis, we introduce the measure below to decide an optimal λ: Assume there are c cell sub-populations and each cell belongs to one and only one subpopulation.Treating each cell subpopulation as a treatment group, we can define SSB(λ) and SST(λ) analogously as Eq. ( 18).If c cell populations are well separated, SSB(λ) SST(λ) is likely to be large.Therefore, the optimal parameter λ * is determined when SSB(λ) SST(λ) achieves the maximum.
In order to reduce the computational complexity, when using high-order information, we first filter the base clustering instead of using all the base clustering for ensemble.We preliminarily ran our ensemble algorithm and summed the weight matrix W to evaluate the importance of the base clustering by score j = N i=1 W ij , j = 1 • • • 50.The top five base clusterings with the largest score are finally determined and integrated to get the final clustering result.Regarding the parameter τ , the optimal choice of the number of neighbors for each cell is the number of cells in the cluster to which it belongs.We selected larger τ for the dataset with more genes to better utilize the rich information of cell-tocell relationships.Therefore, we set the value of τ according to the following formula τ = α , where Here N is the number of cells in the dataset, q is the number of genes and k is the cluster number of cells.α is the data scale parameter, since more cells in large-scale datasets should be selected to construct cell neighborhoods, so we set We compared our method with seven state-of-the-art singlecell clustering methods.Among them, SAFE and SAME are ensemble clustering algorithms developed for single cell.For the five base clustering algorithms, we use the default parameters in both individual setting and ensemble learning.For SAFE and SAME, we also use the default parameters of the methods.It should be noted that cluster number is a critical parameter for the algorithms.In SAFE, SAME and our proposed scEWE where the number of clusters can be determined adaptively, we take the number of clusters determined by the algorithm as input.For the five base clustering methods as comparison methods, we use the real number of clusters for them.Seurat method relies on resolution to determine the number of clusters, and we set the resolution to 1.5.In the SHARP and SC3 method, the seed number is set to 5 uniformly.For the other parameters in Seurat, we specify scale.factor= 10 000, nfeatures=1000, npcs=40 and dims in FindNeighbors=1:10.The results are shown in Tables 2 and 3. SAFE as an ensemble clustering method for scRNA-seq data can demonstrate superiority to SHARP, SIMLR, CIDR and Seurat.In the ensemble comparison partners, SAFE shows better performance in Goolam data, compared with SC3 and SAME.There are no dominant methods that can show best performances in SAFE, SAME and SC3.These methods cannot compete with scEWE in the considered datasets.
In order to compare the embedding capability of the feature space extracted by different methods, we performed t-stochastic neighborhood embedding(tSNE) to visualize the data on a twodimensional space.For methods that perform clustering based on the consensus matrix (SIMLR, SC3 and scEWE), we use the consensus matrix as input for the tSNE visualization.For methods (CIDR, SHARP and Seurat), we use the embedding matrix obtained for input.Since the SAME and SAFE directly outputs the clustering results, it is hard to evaluate the embedding or modeling capability of the methods and hence incapable for us to show tSNE visualizations for these methods.To visualize the clustering results, we provided UMAP visualizations for all the clustering results by the considered methods.We directly reduce the dimensionality of all datasets by UMAP, and then color the cells according to clustering results of different methods.They are attached in the supplementary file.It can be shown that scEWE helps provide a better clustering result compared with other methods.Besides, we provided the UMAP visualization results of our method and the five base clustering methods for all the datasets in the embedding capabilities and the results were also attached in the supplementary file.
The tSNE visualizations for all the six datasets are included in the supplementary file, where four representative datasets (see Figures 2 to 5) are shown for illustration purpose.
In the Usoskin dataset (Figure 2), CIDR does not capture cellular heterogeneity where different cell types are mixed together.For SHARP, cells of the same type are tightly scattered.Although SIMLR clearly separates cells in different clusters, certain type of cells are divided into two distinct clusters.SC3, Seurat and scEWE showed relatively better descriptions of cellular relationships.In particular, scEWE best captures the relationship between cells, where the same type of cells are tightly clustered and different cell types are well separated.Similar results can be revealed in Figures 3 to 5 that scEWE is among the best method for deciphering the heterogeneity in the considered scRNA-seq data sets.
We showed the number of clusters predicted by our method in the six datasets, as shown in Figure 6.It is clear that our estimation of the number of cell clusters is relatively robust.
For the single-cell heterogeneity analysis, we used five methods (SHARP, SIMLR, CIDR, SC3 and Seurat) as base clustering, and got an ensemble clustering result (scEWE) with excellent performance.In addition, we proposed a method for estimating the number of cell clusters based on the variance analysis.Experiments have proved that our estimation is very consistent with the Element-wise Weighted Ensemble | 7  real number of cell clusters under the premise that the number of clusters in the base clusterings is accurate.Some widely used base clustering methods such as CIDR showed unsatisfactory performance on some datasets, hence we tried to analyze the special properties in the considered data sets.We performed tSNE to visualize the original scRNA-seq data to have a understanding on the data distributions, shown in Figure 7(A).It can be seen that most of the data sets were quite noisy where different cell types mixed together, bringing difficulties in correct distinction of cellular heterogeneity.We also evaluated the sparsity of the data by computing the nonzero ratio in each attribute, and see that a number of the  attributes are sparsely distributed, shown in Figure 7(B).In particular, the attribute sparsity was clearly shown in Brain, Treulein and Usoskin data.We found that CIDR showed poor performance in these datasets, the possible explanations might be that CIDR was quite sensitive to the sparsity of the data.SHARP, SIMLR and Seurat were less sensitive to the data sparsity compared with CIDR; however, we can see that in Treulein data when almost all the attributes were sparsely distributed, these methods failed to capture the inherent relationship in the data.

Parameter sensitivity analysis
Aiming at investigating the impact of parameters on the performances of the proposed algorithms, we performed a sensitivity analysis on the parameters γ , σ 1 and σ 2 .γ is the weighting scale parameter in the element-wise weighting module, which lies in the range [0, +∞).As γ approaches 0, the cell weights play a negligible role in the construction of the co-association matrix, and vice versa.In fact, by adjusting γ , we can ensure a proper inf luence on the co-association matrix.σ 1 and σ 2 are the parameters in the optimization module, which represent the contribution of the regularization term in each iteration.In the  experiments, the scale parameters γ ∈ [0, 10], σ 1 ∈ [10 −6 , 10 6 ] and σ 2 ∈ [10 −6 , 10 6 ] are tested.We show the performance of the model in Figures 8 and 9 as these parameters vary.It is observed that our algorithm exhibits sufficient robustness to the parameter selection.For applications, the parameters γ between [0.5, 1] and σ i (i = 1, 2) between [10 −3 , 10 3 ] are preferable as tested to ensure stability and acceptable performance on all the involved datasets.

High-order information in scEWE
To check the inf luence of high-order information, we conducted ablation study to our proposed model.We compared the performance of scEWE with and without high-order information in weight matrix construction.In scEWE with high-order information, we first adaptively determine the optimal λ ∈ (0, 1), and then use the integrated weight matrix for ensemble learning.The results are shown in the table below.It can be seen that high-order information contribute to the performance in a positive manner.
In addition, we check the performance of the model when the value of λ varies.The results are shown in Figure 10, where the adaptively selected optimal λ for each data set is marked by a bullet.In Figure 10, λ = 1 and λ = 0 indicate the cases where only first-order information and only second-order information is used, respectively.It is shown that the determined λ can help achieve optimal or near-optimal performance on the considered datasets.For Deng, Goolam and Usoskin datasets, the model using only first-order information (λ = 1) performs a significant reduction compared with those cases using high-order information (λ < 1).These observations are consistent with the results shown in Table 4 that integration of second-order information would make positive contribution to the model performance.Besides, more observation can be indicated from Figure 10.Particularly, if only second-order information is used (corresponding to λ = 0), the performance of the model is better than that using only firstorder information for most datasets.Possible reasons might be that using high-order information can generally capture deep geometric connections between cells.However, an exception is observed for the Goolam dataset, where the clustering performance of the case using only second-order information is inferior to that of other cases where lower order information is used.Therefore, it is beneficial to integrate the first-order and second-order information in a weight-adaptive manner.This observation confirms the reasonability of our model.

Computational complexity
In order to check the practical applicability of scEWE in diverse scenarios, we conducted analysis detailing the runtime and memory demands.This will help provide a guidance to choose the appropriate base clustering methods for ensemble learning in the real application.We recorded the computational time and memory required to run the base clustering algorithms; the algorithms were run on Windows 10 system with 128GB memory, Inter(R) Xeon(R) Gold 5218 CPU 2.30GHz, and the results were reported in the Tables 5 and 6.
We now analyze the time complexity in the ensemble learning.Assume that the number of cells is n and the number of base clustering methods is m.The co-association matrix initialization takes O(n 2 ) time.High-order information extraction involving Gaussian kernel construction has a time complexity of O(n 2 ).The time complexity of generating a weighted co-association matrix by weight matrix is O(mn 2 ).Assuming that the process takes t 1 iterations, the time complexity in the weighted coassociation matrix construction can be represented as O([t 1 (m + 1)]n 2 + 2t 1 τ mn).Assuming spectral ensemble clustering algorithm takes t 2 iterations to converge, the time complexity of the ensemble learning part can be roughly represented as O(t 2 n 3 ).The computational time and memory requirement for scEWE were shown in Tables 5 and 6.From a computational efficiency perspective, scEWE is efficient when the number of cells in the scRNA-seq dataset is small.However, when the number of cells is relatively large, scEWE shows limited improvement as the ensemble learning sacrifices the computational time to ensure better heterogeneity analysis.

Cluster number in scEWE
Cluster number is a critical parameter which may inf luence the model performance.When the estimated number of cell clusters deviates from the ground truth, we conducted experiments to check how the ensemble clustering results will be affected.We varied the number of clusters from 2 to 15 to report the model performance.The results are presented in Tables 7 and 8.We use bold font to indicate the ARI and NMI values corresponding to scEWE with predicted cluster numbers, and underlines to indicate the results with real cluster number.Through the experiments, we have the following interesting discoveries.First of all, scEWE is relatively robust to the number variations in the three datasets Brain, Deng and Goolam.In Brain data, the real cluster number is 8.We can see from Tables 7 and 8 that the performance of scEWE is stable when cluster number is at the neighborhood of 8.In Goolam dataset, similarly, the performance of scEWE is stable when cluster number is at the neighborhood of the real cluster number 5. When the estimated number of cell clusters deviates from the ground truth, we can still get good clustering results with our predicted cluster number.For the other three datasets, the model performance is sensitive to the cluster number.It is therefore of critical significance to give a good estimation on the cluster number.When we compare the performance of scEWE with predicted cluster number to scEWE with real cluster number, we have the following findings.
Our proposed variance-analysis-based cluster number estimation method can well capture the distribution characteristics of the dataset.For the Treulein dataset, the true number of clusters is three, while the cluster number estimated by our prediction method is four.The ARI and NMI values of scEWE corresponding to the predicted cluster number though are not globally optimal; they are higher than the corresponding performances under the true number of clusters.For Biase data and Usoskin data, where the real cluster numbers are three and four, respectively, our method can provide accurate cluster number estimations.
The performances of scEWE with the predicted cluster number are also the best among the other candidate cluster numbers.These findings suggest the effectiveness of our cluster number method.

Generalization ability
It is necessary to have a clear picture on the property of scEWE if it is robust when the base clustering showed bad performances.We therefore analyzed the impact of base clustering on our ensemble model.On the one hand, we tested whether dropping the worst base clustering result can further improve the ensemble performances.On the other hand, we analyzed whether good base clustering would contribute to the better performance of the ensemble model.'Good base clustering' is denoted if 5% of the cells are wrongly labeled, and 'perfect base clustering' is denoted if all the cells are accurately labeled.We therefore replaced the worst base clustering respectively with 'good base clustering' and 'perfect base clustering' in turn, respectively, to check the inf luence of them to scEWE.The results for the two scenarios of replacement are reported in Tables 9 and 10, respectively.It is interesting to see that scEWE is robust when the base clustering showed bad performance.When we dropped the worst base clustering, the ensemble model remained stable in Biase, Brain, Goolam and Usoskin data.In Deng and Treulein data however, the ensemble model showed degraded performance.Possible explanations might be that the worst base clustering still contains useful information where the ensemble model scEWE can extract.On the other hand, when we replaced the worst base clustering with 'good base clustering' and 'perfect base clustering', respectively, scEWE can benefit from the clustering results and showed improved performances.
The rapid advancement of scRNA-seq technologies enables the generation of more and more large-scale datasets, bringing new challenges in computational cost and effectiveness.scEWE as an element-wise weighted ensemble clustering fully considers the neighborhood relationship between cells, and improves the clustering accuracy at the cost of computational complexity.The advantage of scEWE lies in that it is a very f lexible framework in  the two considered datasets.Our experiments demonstrated that scEWE incorporating high-order information shows limited better performance for large-scale data with a sacrifice of significant computation burden.Comparing the results for small-and largescale data, we have the following findings.If the scale of scRNAseq data is small, the first-order information provides considerably limited understanding for data relationship, and thus, the integration of high-order information is necessary to learn a deep geometric relationship.This reveals the inherent principle of the proposed method.However, if the scale of scRNA-seq data is relatively large, incorporating high-order information can only make limited performance improvement because the first-order information can provide sufficient understanding for establishing an appropriate element-wise weighted co-association matrix.
Considering the high computation cost and limited performance improvement for the proposed scheme incorporating high-order information, element-wise weighted ensemble clustering with first-order information might be preferable for the scenario of large-scale scRNA-seq data.
As an ensemble algorithm, our method has good scalability and generalization ability.Users can specify any of the five clustering methods for integration.For the clustering results generated by other single-cell methods, we can also input them as base clusterings into our ensemble framework.To verify the effectiveness of our algorithm, we applied scEWE to five additional clustering tasks.The five datasets including Glass, IS, MNIST, Texture and Ionosphere (Table 17) are publicly available and can be obtained from [31].Following [31], we constructed a large pool of candidate base clusterings, and each base clustering was generated by the k-means algorithm, and the number of clusterings was randomly selected in the range [2, √ N], where N is the number of samples.We generated 100 base clusterings for each dataset to form the base clustering pool.In each experiment, 10 base clusterings were chosen in a perfectly equally probable manner.Compared with five methods [31], [32] and [33], we provided the mean performance (NMI and ARI) and standard deviation and marked the best score for each dataset in bold.Similar to the above, we set λ = 1, τ = 5, γ = 0.5.After sample weighting by our method to get the consensus matrix, we used hierarchical clustering to generate the final clustering results.The results were shown in Tables 18 and 19, and it can be seen that our method exhibits excellent performance in this challenging test.

Comparison with deep learning algorithms
To better demonstrate the superiority of scEWE, we compared scEWE with efficient deep learning methods scDeepCluster and scGNN.Since scDeepCluster and scGNN allow only count data for processing, if the dataset was not count data, necessary preprocessing was conducted to ensure the implementation of the algorithms.Besides, deep clustering methods such as SDCN [34] and DFCN [35], which integrate high-order structural information into clustering processes, were also introduced for method comparison.For scDeepCluster, we set the number of clusters to the actual number of clusters, the number of pre-trained epochs to 700, the maximum number of iterations to 5000 and the remaining parameters to the default values.For scGNN, we use 'Do not infer LTMG mode' with default parameters.In the construction of k-nearest neighbor graph, k is set to 100 and the maximum number of iterations of the model is set to 2000.For SDCN, we set the number of input nodes to the number of genes in the dataset, the number of clusters to the number of real classes, the training batches to 800, the k value in the k-nearest neighbor map to 50 and the remaining parameters to their default values.For DFCN, we set the training epoch to 7000, set the number of clusters to the number of real classes, set the number of input nodes to 100 and use the default values for the rest of the parameters.The results were shown in Table 20.scDeepCluster demonstrated better performances among the considered deep learning methods.When the number of cells was relatively small, most of the deep learning methods seemed to perform in an unsatisfactory manner.In Klein or Baronh data set, where the number of cells was 2717 and 8569, respectively, scGNN and scDeepCluster showed good performances in capturing the heterogeneity embedded in the single-cell data.In comparison, SDCN and DFCN cannot compete with scGNN and scDeepCluster that are particularly designed for scRNA-seq data.From the results we can see that scEWE can still demonstrate its superiority and effectiveness in dealing with noisy scRNA-seq data.

Understanding biological process
To further explore the capability of scEWE in revealing biological process, we investigated on the biological role of the marker genes that scEWE extracted.We performed heterogeneity analysis on the dataset obtained from [38] containing 124 individual cells in various developmental stages from human preimplantation embryos.The major type of cells identified by scEWE was used for further marker gene analysis.The top 10 differentially expressed genes are 'TERF1', 'HESRG', 'CD24', 'PRDX6', 'AASS', 'SEPHS1', 'NUCKS1', 'M6PR', 'PHF17', 'TUBB'.We performed functional enrichment analysis with Metascape (https: //metascape.org).Pathway and process enrichment analysis has been carried out with Gene Ontology Biological Processes.The top extracted biological process is shown in Figure 11.Take a further look at the marker genes, through Genecards (https:/www.genecards.org/)we found that 'TERF1' as Telomeric Repeat-Binding Factor 1 involves in the biological processes of cell cycle, cell division and mitosis.'HESRG' is Embryonic Stem Cell Related Protein, and 'CD24' has a pivotal role in cell differentiation of different cell types.These findings suggest the roles of genes involving in cell development such as DNA replication, cell differentiation, etc.

CONCLUSIONS
In this paper, we have proposed a high-order element-wise weighted ensemble method of heterogeneity analysis for scRNAseq data: scEWE.Different from traditional weight strategy in ensemble clustering, scEWE novelly incorporates element-wise contribution in each base clustering for weighted co-association matrix construction.A low-rank self-representation framework is incorporated for generating final heterogeneity results.Compared with state-of-the-art methods for scRNA-seq data analysis, scEWE shows robustness as well as superiority in performance.Due to the high complexity of the algorithm, it is worth noting that our method has limited capacity for datasets involving very large number of cells.How to optimize the algorithm to improve the computational efficiency is the work we will consider in the future.

Key Points
• We developed a novel ensemble learning framework: scEWE to deal with heterogeneity analysis problem for scRNA-seq data.• A high-order element-wise weighting strategy was proposed for building the ensemble learning framework.
• Variance-analysis-based low-rank self-representation optimization model was applied for latent embedding and heterogeneity analysis.• The effectiveness of scEWE was demonstrated through extensive experiments in real-world datasets.

Figure 2 .
Figure 2. tSNE visualization of embedding capability for the Usoskin dataset.Subfigures correspond to SHARP, SIMLR, CIDR, SC3, Seurat and scEWE.The different colors represent different clusters in true labels.

Figure 3 .
Figure 3. tSNE visualization of embedding capability for the Goolam dataset.Subfigures correspond to SHARP, SIMLR, CIDR, SC3, Seurat and scEWE.The different colors represent different clusters in true labels.

Figure 4 .
Figure 4. tSNE visualization of embedding capability for the Deng dataset.Subfigures correspond to SHARP, SIMLR, CIDR, SC3, Seurat and scEWE.The different colors represent different clusters in true labels.

Figure 5 .
Figure 5. tSNE visualization of embedding capability for the Treulein dataset.Subfigures correspond to SHARP, SIMLR, CIDR, SC3, Seurat and scEWE.The different colors represent different clusters in true labels.

Figure 6 .
Figure 6.Optimal cluster number determination in different datasets.

Figure 7 .
Figure 7. Data distributions in the considered data sets.The upper figures correspond to the tSNE plots in the original data sets; the lower figures correspond to the nonzero ratio distribution of attributes in the data sets.

Table 1 :
Datasets information

Table 2 :
Performance comparison of different methods in ARI on the considered datasets (the best score in each row is highlighted in bold)

Table 3 :
Performance comparison of different methods in NMI on the considered datasets (the best score in each row is highlighted in bold)

Table 4 :
Performance comparison of scEWE with and without high-order information (./.) represents the ARI and NMI value, respectively

Table 5 :
Runtime comparison on the considered datasets

Table 6 :
Memory (MB) requirement comparison on the considered datasets

Table 11 :
Large-scale datasets information

Table 12 :
Performance comparison in Klein dataset for each method

Table 13 :
Performance comparison in Baronh dataset for each method

Table 14 :
Runtime comparison for large-scale datasets s represents second, m represents minute, h represents hour

Table 15 :
Memory (MB) requirement comparison on large-scale datasets

Table 16 :
Performance comparison of scEWE in large datasets with and without high-order information (./.) represents the ARI and NMI value, respectively

Table 17 :
Datasets information

Table 18 :
Performance evaluated in ARI on extended datasets (the best score in each row is highlighted in bold)

Table 19 :
Performance evaluated in NMI on extended datasets (the best score in each row is highlighted in bold)

Table 20 :
Performance comparison with deep learning methods Figure 11.Top biological process item in [38] by Metascape.